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Abstract. Modern high-end machines feature multiple processor packages, each 
of which contains multiple independent cores and integrated memory controllers 
connected directly to dedicated physical RAM. These packages are connected via 
a shared bus, creating a system with a heterogeneous memory hierarchy. Since 
this shared bus has less bandwidth than the sum of the links to memory, aggre- 
gate memory bandwidth is higher when parallel threads all access memory local 
to their processor package than when they access memory attached to a remote 
package. 

But, the impact of this heterogeneous memory architecture is not easily under- 
stood from vendor benchmarks. Even where these measurements are available, 
they provide only best-case memory throughput. This work presents a series of 
modifications to the well-known STREAM benchmark to measure the effects of 
NUMA on both a 48-core AMD Opteron machine and a 32-core Intel Xeon ma- 
chine. 

1 Introduction 

Inexpensive multicore processors and accessible multiprocessor motherboards have 
brought all of the challenges inherent in parallel programming with large numbers of 
threads with non-uniform memory access (NUMA) into the foreground. Functional pro- 
gramming languages are a particularly interesting approach to programming parallel 
systems, since they provide a high-level programming model that avoids many of the 
pitfalls of imperative parallel programming. But while functional languages may seem 
like a better fit for parallelism due to their ability to compute independently while avoid- 
ing race conditions and locality issues with shared memory mutation, implementing a 
scalable functional parallel programming language is still challenging. Since functional 
languages are value-oriented, their performance is highly dependent upon their mem- 
ory system. This system is often the major limiting to improved performance in these 
systems IMPS09IAndlOI . 

Our group has been working on the design and implementation of a parallel func- 
tional language to address the opportunity afforded by multicore processors. In this 
paper, we describe some benchmarks we have used to measure top-end AMD and Intel 
machines to assist in the design and tuning of our parallel garbage collector. 

This paper makes the following contributions: 

1 . We describe the architecture of and concretely measure the bandwidth and latency 
due to the memory topology in both a 48-core AMD Opteron server and a 32-core 



Intel Xeon server. Looking only at technical documents, it is difficult to understand 
how much bandwidth is achievable from realistic programs and how the latency of 
memory access changes with increased bus saturation. 



1.1 AMD Hardware 

Our AMD benchmark machine is a Dell PowerEdge R8 15 server, outfitted with 48 cores 
and 128 GB physical memory. The 48 cores are provided by four AMD Opteron 6172 
"Magny Cours" processors BCarlCKD + 10l . each of which fits into a single G34 socket. 
Each processor contains two nodes, and each node has six cores. The 128 GB physi- 
cal memory is provided by thirty-two 4 GB dual ranked RDIMMs, evenly distributed 
among four sets of eight sockets, with one set for each processor. As shown in Figure[T[ 
these nodes, processors, and RAM chips form a hierarchy with significant differences in 
available memory bandwidth and number of hops required, depending upon the source 
processor core and the target physical memory location. Each 6 core node (die) has a 
dual-channel double data rate 3 (DDR3) memory configuration running at 1333 MHz 
from its private memory controller to its own memory bank. There are two of these 
nodes in each processor package. This processor topology is also laid out in Table[T] 



RAM 



xl6 HT3 
to I/O 



DDR3 
Dual-Channel 




RAM 



xl6 HT3_ 
to I/O 



DDR3 
"Dual-Channel - 



xl6HT3 x8HT3 Processor 





Node 1, 6 cores 



x8 HT3_ 
to PI 

x8 HT3 
to P2 " 

|x8 HT3_ 
' to P3 



Fig. 1. Interconnects for one processor in a quad AMD Opteron machine. 



Bandwidth between each of the nodes and I/O devices is provided by four 16-bit 
HyperTransport 3 (HT3) ports, which can each be separated into two 8-bit HT3 links. 
Each 8-bit HT3 link has 6.4 GB/s of bandwidth. The two nodes within a package are 
configured with a full 16-bit link and an extra 8-bit link connecting them. Three 8-bit 



Component 


Hierarchy 


# Total 


Processor 


4 per machine 


4 


Node 


2 per processor 


8 


Core 


6 per node 


48 



Table 1. Processor topology of the AMD machine. 



links connect each node to the other three packages in this four package configura- 
tion. The remaining 16-bit link is used for I/O. Figure [2] shows the bandwidth available 
between the different elements in the hierarchy. 





Bandwidth (GB/s) 


Local Memory 


21.3 


Node in same package 


19.2 


Node on another package 


6.4 



Table 2. Theoretical bandwidth available between a single node (6 cores) and the rest 
of an AMD Opteron 4P system. 



Each core operates at 2.1 GHz and has 64 KB each of instruction and data LI cache 
and 512 KB of L2 cache. Each node has 6 MB of L3 cache physically present, but, by 
default, 1 MB is reserved to speed up cross-node cache probes. 

1.2 Intel Hardware 

The Intel benchmark machine is a QSSC-S4R server with 32 cores and 256 GB physi- 
cal memory. The 32 cores are provided by four Intel Xeon X7560 processors [ Int QSS ]. 
Each processor contains 8 cores, which can be but are not configured to run with 2 si- 
multaneous multithreads (SMT). This topology is laid out in Table [3] As shown in 
Figure |2j these nodes, processors, and RAM chips form a hierarchy, but this hierarchy 
is more uniform than that of the AMD machine. 



Component 


Hierarchy 


# Total 


Processor 


4 per machine 


4 


Node 


1 per processor 


4 


Core 


8 per node 


32 



Table 3. Processor topology of the Intel machine. 



Each of the nodes is connected to two memory risers, each of which has a dual- 
channel DDR3 1066 MHz connection. The 4 nodes are fully connected by full-width 
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Fig. 2. Interconnects for one processor in a quad Intel Xeon machine. 



Intel QuickPath Interconnect (QPI) links. Figure [4] shows the bandwidth available be- 
tween the different elements in the hierarchy. 



Table 4. Theoretical bandwidth available between a single node (8 cores) and the rest 
of an Intel Xeon system. 



Each core operates at 2.266 GHz and 32 KB each of instruction and data LI cache 
and 256 KB of L2 cache. Each node has 24 MB of L3 cache physically present but, by 
default, 3 MB is reserved to speed up both cross-node and cross-core caching. 



Bandwidth (GB/s) 



Local Memory 
Other Node 



17.1 
25.6 



2 Measuring NUMA effects 



In Section [TTTj we described the exact hardware configuration and memory topology of 
our 48 core AMD Opteron system. We also described the 32 core Intel Xeon system in 
Section [L2] These systems are the subject of the NUMA tests below. 



2.1 STREAM benchmark 



The C language STREAM benchmark [McC07 1 consists of the four operations listed in 
Table [5] These synthetic memory bandwidth tests were originally selected to measure 
throughput rates for a set of common operations that had significantly different per- 
formance characteristics on vector machines of the time. On modem hardware, each of 
these tests achieve similar bandwidth, as memory is the primary constraint, not floating- 
point execution. The COPY test, in particular, is representative of the type of work per- 
formed by a copying garbage collector. 



Name 


Code 


COPY 


a[i] = 


b[i]; 


SCALE 


a[i] = 


s*b[i ]; 


SUM 


a[i] = 


b[i]+c[i]; 


TRIAD 


a[i] = 


b[i]+s*c[i ]; 



Table 5. Basic operations in the STREAM benchmark. 



The existing STREAM benchmark does not support NUMA awareness for either 
the location of the running code or the location of the allocated memory. We modi- 
fied the STREAM benchmark to measure the achievable memory bandwidth for these 
operations across several allocation and access configurations. The baseline STREAM 
benchmark allocated a large, static vector of double valuesQOur modifications use 
pthreads and libnuma to control the number and placement of each piece of running 
code and corresponding memory |But97 Kle04|. 

While the STREAM benchmark's suggested array sizes for each processor are 
larger than the L3 cache, the tests do not take into account cache block sizes. We ex- 
tended the tests with support for strided accesses to provide a measure of RAM band- 
width in the case of frequent cache misses. This strided access support also allows us to 
measure the latency of memory access. 

2.2 Bandwidth evaluation 

Figure |3]plots the bandwidth, in MB/s, versus the number of threads. Larger bandwidth 
is better. The results are for the COPY test, but all of the tests were within a small factor. 
Four variants of the STREAM benchmarks were used: 

1. Unstrided accesses memory attached to its own node and uses the baseline 
STREAM strategy of sequential access through the array. 

2. Unstrided+non-NUMA accesses the array sequentially, but is guaranteed to access 
that memory on another package. 

3. Strided also accesses local memory, but ensures that each access is to a new cache 
block. 



1 There is no difference in bandwidth when using long values. 



4. Strided+non-NUMA strides accesses, and also references memory from another 
package. 

NUMA aware versions ensure that accessed memory is allocated on the same node as 
the thread of execution and that the thread is pinned to the node, using the libnuma 
library [Kle04|. To do this, the modified benchmark pins the thread to a particular node 
and then uses the libnuma allocation API to guarantee that the memory is allocated 
on the same node. The non-NUMA aware versions also pin each thread to a particu- 
lar node, but then explicitly allocate memory using libnuma from an entirely separate 
package (not just a separate node on the same package). When there are less threads 
than cores, we pin threads to new nodes rather than densely packing a single node. 

It should not be surprising that the unstrided variants exhibit roughly eight times 
the bandwidth of their strided versions, as cache blocks on these machines are 64 bytes 
and the double values accessed are each 8 bytes. In the NUMA aware cases, scaling 
continues almost linearly until eight threads and increases until the maximum number 
of available cores on both machines. On AMD hardware, non-NUMA aware code pays 
a significant penalty and begins to lose bandwidth where NUMA aware code does not 
at 48 cores. On the Intel hardware the gap between NUMA and non-NUMA aware code 
is very small even when the number of threads is the same as the number of cores. But, 
the Intel hardware does not offer as much peak usable bandwidth for NUMA-aware 
code, peaking near 40,000 MB/s whereas we achieve nearly 55,000 MB/s on the AMD 
hardware. 

Figure |4] plots the bandwidth against threads again, but this time divided by the 
number of active nodes to provide a usage data relative to the theoretical interconnect 
bandwidth detailed in Table[2]for the AMD machine and Table|4]for the Intel machine. 
Our benchmarks allocate threads sparsely on the nodes. Therefore, when there are less 
than 8 threads on the AMD machine, that is also the number of active nodes. On the Intel 
machine with 4 nodes, when there are less than 4 threads, that is the number of active 
nodes. These graphs show that on both machines there is a significant gap between the 
theoretical bandwidth and that achieved by the strided COPY stream benchmark. It is 
also clear that there is a significant non-NUMA awareness penalty on the AMD machine 
but that penalty is less on the Intel machine. However, the Intel machine begins to reach 
saturation at 32 cores, whereas the NUMA aware AMD machine continues to increase 
per-node bandwidth up to 48 cores. 

2.3 Latency evaluation 

Figure |5]plots the latency times, in nanoseconds, versus the number of threads. Smaller 
latency times are better. To measure the latency times, we only consider the strided 
STREAM benchmarks to ensure that we are measuring only the time to access RAM, 
and not the time to access cache. As was the case with bandwidth, the AMD machine's 
NUMA aware tests maintain good values up to large numbers of processors. The non- 
NUMA aware AMD benchmark begins to exhibit high latencies at moderate numbers 
of threads. On the Intel machine, latency numbers remain low until more than 24 cores 
are in use, and then the latencies grow similarly for both NUMA aware and non-NUMA 
aware code. 
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Fig. 3. Bandwidth vs. number of threads on the STREAM benchmark, comparing 
strided and NUMA configurations. Larger bandwidth is better. 
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Fig. 4. Bandwidth per node vs. number of threads on the STREAM benchmark, com- 
paring strided configurations. Larger bandwidth is better. 
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Fig. 5. Latency times versus the number of threads on the STREAM benchmark, com- 
paring strided configurations. Smaller latency times are better. 



On the AMD machine, these benchmarks and evaluations clearly indicate that at 
high numbers of threads of executions, poor choices of memory location or code exe- 
cution can have significant negative impact on the latency of memory access and mem- 
ory bandwidth. For the Intel machine, memory performance uniformly increases un- 
til high numbers of threads, at which point is uniformly decreases, seeing little effect 
from NUMA awareness. But, none of these effects show up in practice until more than 
24 threads are in use. 

3 Conclusion 

We have described and measured the memory topology of two different high-end ma- 
chines using Intel and AMD processors. These measurements demonstrate that NUMA 
effects exist and require engineering beyond that normally employed to achieve good 
locality and cache use. 

Further, we have shown that the NUMA penalty is significantly lower on Intel sys- 
tems due to the larger cross-processor bandwidth provided by QPI. But, the AMD sys- 
tem provides greater total bandwidth for NUMA-aware applications. 
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